Asian writing systems, such as that used in China, are substantially different from writing systems developed in other parts of the world. In writing systems for most Western languages, characters are employed to represent sounds in spoken words. With relatively few characters, the characters can be arranged in many different combinations to represent the thousands of sound combinations used in speech.
In contrast, in the Chinese writing system, characters typically do not represent individual sounds in spoken words. Rather, the character represents an idea or concept. Consequently in the Chinese writing system, thousands of different characters have been developed corresponding to thousands of different concepts. In general, the Chinese writing seems to be much more complex than that used in most Western countries because a much greater number of characters form the Chinese writing system.
Further complicating the Chinese writing system is that the characters are combined into sentences with essentially no variation in spacing between characters. While a single Chinese character may correspond to an entire word, often two or more characters together correspond to a word. Hence, it can be difficult to distinguish individual words from one another in a sentence written in Chinese because there is little to indicate where one word ends and another begins, i.e., there is no spacing between words. Punctuation can be relied on as delimiter between words, such as periods or commas, as well as words customarily written in English and appearing in a sentence otherwise formed of Chinese characters. Frequently, though, there will be no delimiter between one word and the next within a sentence written in Chinese.
In this respect, Chinese can be more problematic than Japanese. The Japanese system initially seems to appear more complex than Chinese, in that the Japanese writing system employs three character sets: (1) kanji; (2) hiragana; and (3) katakana. In addition, some words are commonly written in English in Japanese.
In Japanese, the kanji characters are based in substantial part on the Chinese writing system. Specifically, many kanji characters used in the Japanese writing system are similar or substantially identical to Chinese characters for representing corresponding concepts, although pronunciation is often completely different. In Japanese therefore, as in Chinese, such characters typically do not represent individual sounds in spoken words. Hence, the Japanese writing system is complex in that it is formed of thousands of different characters.
In written Japanese, as in Chinese, sentences have essentially no variation in spacing between characters forming the sentence, i.e., there is no spacing between words. Notwithstanding, it is usually easier to distinguish one word from another in a written Japanese sentence because of the other two Japanese character sets, hiragana and katakana.
Hiragana and katakana are both phonetic alphabets. Specially, both employ a set of characters representing sounds in spoken words. Katakana is generally used in the Japanese writing system to spell words from foreign languages used in Japanese. Hiragana is used for, among other things, words of Japanese origin for which there is no kanji character, as subject or object markers, showing location such as at, in, by and etc, for showing possessive states, and indicating tenses. In a written Japanese sentence, hiragana and/or katakana characters often separate words in kanji characters from one another, thereby making it easier to distinguish one word from another relative to a comparable sentence in Chinese.
For example, to write the child's dog in Japanese, a hiragana character indicating possession will appear between the kanji characters for child and dog. Thus, relative to Chinese, it is easier in Japanese to distinguish words from one another due to characters from the Japanese phonetic character sets appearing in sentences in the Japanese writing system.
Difficulties have been encountered in developing information systems capable of accurately processing articles or text in Asian writing systems, such as Chinese, Japanese, Korean, and etc. While difficulties may not be as problematic with some Asian writing systems, such as Japanese, difficulties have arisen in general with such Asian writing systems.
One difficulty in particular has been in developing an information processing system capable of accurately distinguishing names of persons or organizations in Chinese from surrounding textual material. Such processing would be useful for instance, for searching articles for keywords or pertinent phrases to locate articles relevant to a particular subject and/or for indexing articles for future document retrieval. For example, someone may wish to locate and/or index articles concerning a famous Chinese person. In addition, such processing would be useful for more accurate computer translation of Asian text into another language, such as English.
In Chinese and other languages, a person's name may have meaning when used in a context other than as a personal name. In English for instance, “king” is a relatively common surname and also a noun meaning the head of state in countries with some form of monarchial government. Similarly, in Chinese the word “wang” (also commonly written “wong” in English) is a common surname, but it also means king or emperor. Accurate translation requires a system capable of reliably distinguishing when such a word is used as a personal name or as a noun meaning something else.